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Abstract 

Background: Food security is an issue that has come under renewed scrutiny amidst concerns that substantial 
yield increases in cereal crops are required to feed the world's booming population. Wheat is of fundamental 
importance in this regard being one of the three most important crops for both human consumption and livestock 
feed; however, increase in crop yields have not kept pace with the demands of a growing world population. In 
order to address this issue, plant breeders require new molecular tools to help them identify genes for important 
agronomic traits that can be introduced into elite varieties. Studies of the genome using next-generation 
sequencing enable the identification of molecular markers such as single nucleotide polymorphisms that may be 
used by breeders to identify and follow genes when breeding new varieties. The development and application of 
next-generation sequencing technologies has made the characterisation of SNP markers in wheat relatively cheap 
and straightforward. There is a growing need for the widespread dissemination of this information to plant 
breeders. 

Description: CerealsDB is an online resource containing a range of genomic datasets for wheat {Triticum aestivum) 
that will assist plant breeders and scientists to select the most appropriate markers for marker assisted selection. 
CerealsDB includes a database which currently contains in excess of 100,000 putative varietal SNPs, of which several 
thousand have been experimentally validated. In addition, CerealsDB contains databases for DArT markers and EST 
sequences, and links to a draft genome sequence for the wheat variety Chinese Spring. 

Conclusion: CerealsDB is an open access website that is rapidly becoming an invaluable resource within the wheat 
research and plant breeding communities. 
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Background 

The issue of food security (defined as the access to suffi- 
cient, safe, nutritious food to maintain a healthy and ac- 
tive life) has become a focus of concern for governments 
around the world and has moved steadily up the political 
agenda. The human population is expected to reach nine 
billion by 2050, and it has been estimated that cereal 
production will need to increase by 50% by this time if 
we are to meet the challenge of feeding ourselves and 
the animals upon which we are dependent [1]. The 
cereal crops wheat, maize and rice are the main source 
of nutrition for humans and domesticated animals [2]. 
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Wheat, which accounts for approximately 30% of global 
cereal consumption [3], is of fundamental importance in 
this regard with an estimated 2010 harvest of 651 mil- 
lion metric tonnes [4]. Ensuring that yields of wheat in- 
crease to meet future needs has become an important 
focus in agricultural research. The remarkable increases 
in yield experienced during the period 1950 to 2000, 
which were based on traditional breeding techniques, 
have unfortunately been unable to keep up with the 
increased demand of a rapidly expanding world popula- 
tion. In addition, the relative increase in wheat yields has 
slowed towards the end of the twentieth century [5]. To 
help address this need plant breeders and scientists must 
embrace the modern molecular biological technologies 
that allow them a more targeted approach to plant 
improvement. 



© 201 2 Wilkinson et al.; licensee BioMed Central ttd. This is an Open Access article distributed under the terms of the Creative 
Commons Attribution ficense (http://creativecommons.Org/licenses/by/2.0), which permits unrestricted use, distribution, and 
reproduction in any medium, provided the original work is properly cited. 



Wilkinson et al. BMC Bioinformatics 2012, 13:219 
http://www.biomedcentral.com/1471 -21 05/1 3/219 



Page 2 of 6 



Single nucleotide polymorphisms (SNPs), which are sin- 
gle base pair changes in the DNA sequence, are the most 
common form of sequence variation between individuals 
of the same species. Because of their frequency in the 
genome, SNPs have become the marker of choice for 
marker assisted selection (MAS) of agronomic traits of 
importance such as disease and drought resistance [6]. 
In order to target genes at any position throughout the 
genome it is necessary to saturate the wheat genome 
with validated SNP markers that can be used in MAS. 
Through the use of next-generation sequencing (NGS), 
our laboratory is actively searching for SNPs in the wheat 
genome and developing PCR assays for these so that they 
can be used as molecular markers by plant breeders. 

The development of NGS platforms and high- 
throughput genotyping methods has made it possible to 
rapidly characterise and validate thousands of putative 
SNP markers in wheat at relatively low cost [7,8]. A 
number of online databases already exist that link to SNP 
markers and other genomic resources, such as Gramene 
which is an online data resource for comparative genome 
analysis in the grasses [9] and GrainGenes, a database for 
genetic and genomic information about Triticeae species 
and their wild relatives [10]. Whilst these sites contain a 
wealth of information, they have been designed primarily 
for scientists. In order to maximise the utility and applica- 
tion of SNP markers for plant breeders, it is imperative 
that such information is made freely available and easily 
accessible to non-experts in molecular biology. This re- 
quirement for the dissemination of validated SNP marker 
data to plant breeders has provided the impetus for the 
development of the CerealsDB SNP database. 

We have greatly extended the functionality of CerealsDB, 
which was originally constructed to store a dataset of 
26,382 EST sequences [11], to include a number of search- 
able online databases relating to wheat genomics. Princi- 
pally, the CerealsDB website is now aimed at those who 
wish to obtain information about SNP markers, for ex- 
ample, the sequence upon which they are based, the pri- 
mers used for their PCR-based identification or SNP 
haplotype information of common UK varieties. The 
CerealsDB website also allows users to search the ori- 
ginal CerealsDB EST sequences and the draft genome 
sequence of wheat variety Chinese Spring (assembled 
using Roche 454 technology) and to search for diversity 
array technology (DArT) markers that have been bin- 
mapped to chromosomes arms. 

Construction and content 

Implementation 

CerealsDB was implemented using a MySQL relational 
database management system (RDMS) database (version 
5.051b ) running on a MacOSX server and hosted using 
the Apache web server (version 2.2) with Perl and PHP 



scripts used for all data retrieval and output. MySQL 
was selected for the database, as it is one of the leading 
open source industrial strength RDMS, and matches the 
quality and performance of other major proprietary 
databases. MySQL also finds advantage in multi-layered 
server design with independent modules and fast execu- 
tion process. The Apache HTTP server was adopted due 
to its reliability, relative ease of configuration and wide 
language interface support system. 

CerealsDB is a relational collection of entities (tables) 
with appropriate attributes (fields) to define the wheat 
SNPs. An entity-relationship (ER) model for the SNP 
database is displayed in Figure 1. At present, the SNP 
database is composed of 11 tables of which four share 
an identifying relationship with the 'Contig' table 
through their common primary key, 'SNPid'. The 'Con- 
tig' table, which contains 111,442 records (representing 
putative SNP loci), consists of 16 attributes and includes 
information on those contigs that contain putative SNPs: 
e.g. source of the contig data, contig sequence and SNP 
position(s) within the sequence. The 'Contig' table also 
contains a field indicating whether the SNPs have been 
validated experimentally. 

The 'Location' table consists of 3,813 entries and con- 
tains information with regard to chromosomal location 
and position in centimorgans (cM) of all the SNPs that 
have been mapped. The 'Primers' table, which has 6,337 
records, contains the KASPar primer sequences (two 
forward and one reverse) for all the SNPs that have been 
experimentally assayed. The 'Haplotype' table includes 
6,337 records and contains the SNP calls for all the vali- 
dated SNPs against a panel of 101 wheat varieties. The 
'blast_url2' table holds 37,837 entries that are linked to 
the contig table by the 'NewContigName" attribute in a 
1:1 identifying relationship. This table contains all hits 
reported when a BLASTX similarity search was per- 
formed for each contig versus the NCBI 'embryophyta 
plant' protein database [12]. Inclusion of this table allows 
users to search for SNPs on contigs that have a particu- 
lar annotation (e.g., disease resistance). The 'Source' 
table cannot currently be queried by CerealsDB users 
and serves as an in-house resource. It contains 3,263 en- 
tries and holds information on the original source of a 
validated SNP and the location of the DNA on a sample 
plate stored in our laboratory. 

The 'Homo_and_var_SNPs' (511,439 entries) and 'Varie- 
talSNPS' tables (99,945 entries) are linked to one another 
by a 1:1 non-identifying relationship and are unlinked to 
any of the other tables. The 'HomoandvarSNPs' table 
contains information derived from a particular experi- 
ment designed to identify both homoeologous and var- 
ietal SNPs in eight wheat varieties (Alchemy, Avalon, 
Cadenza, Hereward, Rialto, Robigus, Savannah and Xil9): 
a homoeologous SNP is a sequence variation which exists 
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between the different genomes (A, B and D) which make 
up the hexaploid wheat genome and a varietal SNP exists 
between wheat varieties and enables the differentiation of 
two varieties based upon that particular SNP location. 
The table, which contains 511,439 entries, contains a pri- 
mary key which uniquely identifies the SNP and contig 
from which it has been derived, and the SNP calls, depth 
and value, for each locus for all eight varieties. This infor- 
mation has been published in Winfield et al. [13]. 

The 'VarietalSNPS' table contains information solely 
on those putative SNPs that have been classified as var- 
ietal; that is, SNPs that clearly distinguish one wheat var- 
iety from another. These varietal SNPs are of particular 
interest to breeders. This table is used principally for 
database searches used to populate the summary tables 
at the head of several web pages. 

There are three stand-alone tables that are not linked to 
any other table in the database. The 'WheatVariety' table 



contains 101 records, one entry for each variety studied, 
and includes information on pedigree, winter/spring 
habit, and the original plant breeder for the 101 wheat 
varieties studied so far. The Images table (21 records) 
contains the URL to images of SNP maps of the 21 wheat 
chromosomes, and the table 'IUPAC_codes' holds 36 en- 
tries for translation of ambiguity codes. 

The CerealsDB site has been extensively redesigned 
using cascading style-sheets (CSS) to standardise the for- 
mat of the site, and improve the presentation, to provide 
a more intuitive browsing experience. Links are provided 
to the WheatBP educational resource and to other related 
sites that provide additional background information on 
wheat genomics. 

Data sources 

SNP data was included from previous studies [14] as 
well as from ongoing experiments. SNPs were validated 
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Figure 1 Entity-relationship (ER) model for the cerealsDB SNP database, generated with MySQL workbench. The schema displays the 
relationships between different tables within the SNP database. 
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using the KBioscience Competitive Allele-Specific PCR 
(KASPar) genotyping method which relies on the differ- 
ential amplification of allele specific probes [15]. All 
SNP data were curated to remove redundancy and dupli- 
cate values. The contigs used in the database were gen- 
erated from an assembly of wheat (5x Chinese Spring 
var.) cDNA sequences. Unique IDs were created for each 
contig using an 11 -character code, a Bristol contig 
(BC) prefix for reference sequences followed by 9 digits 
(e.g., BC000000001). All unique SNPs were designated 
a 10-character code, beginning with a Bristol SNP (BS) 
prefix, followed by 8 digits (e.g., BS00000001). Cross- 
referencing and standardisation of file formats was per- 
formed using custom Perl scripts. The CerealsDB site also 
contains sequences from the draft assembly of the Chin- 
ese Spring genome, which is comprised of 5,321,847 con- 
tigs with a 5 x genome coverage. At this level of coverage, 
we would expect to have at least one read for > 95% of the 
genome. 

The KASPar SNP database currently consists of data on 
111,442 SNPs derived from 8 wheat varieties of which 
99,945 are varietal SNPs (these are the SNPs that are of 
value to plant breeders) . Of these SNPs 4,986 have been 
validated (by screening against 101 varieties) and 3,813 
mapped to a specific chromosomal location. Validated 
SNP entries contain information on location within a 
reference contig and primer sequences used for validation. 

CerealsDB includes the original wheat EST database, 
containing 26,382 expressed sequence tags that were 
generated from hexaploid winter wheat (var. Mercia) 
cDNA and previously described by Wilson et al. [11]. 
These wheat ESTs can be searched by gene name or 
BLAST. There is also a database of DArT data that can 
be accessed through a clickable wheat ideogram; each 
chromosome image of the ideogram takes you to data 
related to that particular chromosome. DArT technol- 
ogy, which was developed as a hybridization-based tool 
for genotyping, incorporates many features of AFLP ana- 
lysis combined with high throughput genetic analysis 
using a DNA microarray platform. We used deletion 
lines obtained from the Kansas Wheat Genetic and Gen- 
omic Resources Center (WGGRC) (http://www.k-state. 
edu/wgrc/Germplasm/Deletions/delindex.html) to assign 
DArT markers to chromosome arm specific bins. Of the 
420 deletion lines held at the WGGRC, 74 lines were 
selected because they were the lines used in the NSF 
Wheat EST Genomics Project designed to assign wheat 
ESTs to chromosome bins. These lines possess homolo- 
gous terminal deletions in single chromosome arms. 
These deletion lines (developed in the wheat variety Chin- 
ese Spring) provide powerful tools for the physical map- 
ping of wheat chromosomes giving us the ability to map 
markers not only to specific homoeologous chromosomes 
but to assign them to bins of around 28 Mb [16] on 



specific chromosome arms. Whilst DArT markers are no 
longer used by plant breeders (having been superseded by 
SNP markers), this information is still relevant to the 
broader scientific research community. 

Utility and discussion 

Web interface and user querying capabilities 

The CerealsDB website contains a range of datasets that 
are accessible via the menu bar and include the KASPar 
SNP database, a BLAST search page to query the Wheat 
genomic sequence, a download page for the draft wheat 
genome, a wheat EST database, a search page for wheat 
DArT data and a page describing the mapping of DArT 
markers using wheat deletion lines. 

The KASPar SNP database allows SNP name and con- 
tig queries. Some queries (e.g. SNP name) are exact. Se- 
quence queries are BLAST similarity based [17]. Some of 
the functionality of the CerealsDB site is displayed in 
Figure 2. All SNPs listed in CerealsDB have been 
assigned a unique ID and the output page for a query 
gives information about the gene name and locus. All 
validated SNP information can be searched by chromo- 
some and sub-genome and the resulting data is down- 
loadable as an Excel spreadsheet. 

The interface is designed to be simple and provide 
quick access for plant breeders who may not possess an 
in depth knowledge of genetics. Chromosome informa- 
tion is accessed via a clickable image (ideogram). These 
are arrayed on the screen according to the chromosome 
and sub-genome to which they belong. The initial page 
for the KASPar SNP database contains statistics on the 
SNPs within the database and also links to bar graphs 
showing the distribution across the 21 wheat chromo- 
somes of all mapped SNPs for the mapping populations 
Avalon x Cadenza and Rialto x Savannah. In addition, 
there are also graphs showing the relationship between 
the expected number of mapped SNPs to each of the 
three genomes (A, B and D), based on their relative 
lengths, and the actual number mapped. 

There are four main query roots for the KASPar SNP 
database: 

1) "Select primers"; this can provide information on the 
primers for individual SNPs or for all the SNPs on a 
particular chromosome. Primers for SNPs on each 
chromosome can be selected by clicking on the 
relevant chromosome image of the ideogram, or by 
entering a specific SNP ID. Alternatively, all SNPs 
located on that particular chromosome can be 
downloaded as an Excel spreadsheet. Scrolling down 
the page reveals a map of the chromosome and the 
SNPs that are currently mapped to it. 

2) "Obtain haplotypes"; select for information on the 
SNP alleles in the 101 wheat varieties against which 
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Figure 2 The functionality of the KASPar SNP database is displayed with (a) the CerealsDB homepage, from which a range of pages 
are available for querying the SNP database. For example, contigs containing SNPs can be searched by BLAST (b) and SNP maps can be 
dynamically retrieved from the database by clicking on ideograms (c) for a particular chromosome on one of the sub genomes. 



the SNPs have as yet been validated. Clicking the 
relevant chromosome image of the ideogram will 
take the user to information on that particular 
chromosome. Alternatively, haplotypes can be 
selected for particular varieties on specific sub- 
genomes and chromosomes. Once again, results can 
be downloaded as an Excel spreadsheet. 

3) "Blast contigs"; allows the user to input a nucleotide 
sequence and perform similarity searches (using 
BLAST) against the Bristol contigs. If the search 
identifies a contig and it contains SNPs, then the 
user can display the positions of each SNP within 
that particular contig. 

4) "Contig information"; facilitates the search for 
information about the contig on which a SNP is 
found by providing a SNP ID, or the contig name to 
see whether it contains any SNPs. If a SNP is found, 
the output displays the contig sequence, SNP 
position, information on mapping and any associated 
BLAST annotation (the original BLASTX reports are 
also included). A field is provided to search for 
contigs based on a specific term, for example 'disease 



resistance' or 'cold', and will return all contigs that 
contain this term in the contigs' associated BLASTX 
report. 

The draft assembly of the gene-rich regions of the 
Chinese Spring genome, or the raw sequence reads can 
be searched using BLAST and a drop down field is avail- 
able for setting the e-value cut-off. 

Wheat ESTs can be searched by gene name, similarity 
search (via BLAST) or westdb clonelD. The wheat DArT 
markers can be searched by marker name or wheat line 
and the facility is available to download the DArT data- 
sets in Excel format. Mapped DArT markers can also be 
viewed on their respective chromosome by clicking the 
appropriate chromosome image of the ideogram. 

Finally, there is a Help/FAQ page for users, with 
answers to potential questions about the data in the 
database and other features on the website. 

Conclusions 

Substantial enhancement has been made to the original 
CerealsDB site through the inclusion of databases for 
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SNP and DArT markers as well as other genomic 
resources. In order to make CerealsDB comprehensible to 
non-scientists and to retain its relevance to scientific spe- 
cialists, version 2.0 has improved both functionality and 
the user-friendliness of the web interface. As the SNP 
database grows in size and the number of KASPar vali- 
dated SNPs increases, we expect traffic to the CerealsDB 
site to grow. The website has already attracted consider- 
able interest from the plant-breeding and research com- 
munities and has recorded, on average, more than 5,600 
unique visits per month over the past year (web traffic 
statistics generated by Webalizer version 2.23). Our la- 
boratory is actively developing and validating SNP mar- 
kers, and these are regularly added to the database. 
Future developments planned for the CerealsDB site in- 
clude the linking of SNP data with phenotypic data and 
the integration of the KASPar SNP database with other 
cereal genomics databases. We aim to make CerealsDB 
an important, freely available resource by releasing all 
data into the public domain without restrictions. This will 
ensure that wheat breeders across the world have the data 
they require to create new varieties of wheat that will be 
needed to help feed the world beyond 2050. 

Availability and requirements 

Availability 

CerealsDB can be accessed online from the following 
URL : http://www.cerealsdb.uk.net/CerealsDB/Documents/ 
DOC_CerealsDB.php. The SNP database is publicly and 
freely accessible, requiring no registration and with no 
restrictions on use. 

Technical requirements 

It is recommended that one of the following browsers is 
used: Mozilla Firefox 3 on Linux, Mac OSX or Win- 
dows, Internet Explorer 8 on Windows, Safari 4 on Mac 
OsX or Windows, Chrome on Linux or Windows. 
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